Closed Bug 1442893 Opened 7 years ago Closed 7 years ago

Intermittent talos Aborting task - max run time exceeded!

Categories

(Testing :: Talos, enhancement)

Version 3
enhancement
Not set
normal

Tracking

(firefox61 fixed)

RESOLVED FIXED
mozilla61
Tracking Status
firefox61 --- fixed

People

(Reporter: aryx, Assigned: jmaher)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell disabled])

Attachments

(1 file)

Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
Bug 1420394 is about reftests.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
There are 69 failures in the past week. Platforms: most of the occurrences are on OS X 10.10 opt and debug, but we also have some on Windows 7 opt and pgo, windows2012-32 opt and windows10-64 pgo. Recent failure log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=173114870 :rwood Can you please take a look at this?
Flags: needinfo?(rwood)
Whiteboard: [stockwell needswork]
There are 62 failures in the last 7 days. They occur mostly on OS X 10.10, Windows 7, windows10-64, linux64-ccov, macosx64-nightly. The affected builds type are: debug, opt, pgo. Recent failure log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=174949491 Aborting task - max run time exceeded! [taskcluster 2018-04-21T16:31:18.261Z] Exit Code: -1 [taskcluster 2018-04-21T16:31:18.261Z] === Task Finished === [taskcluster 2018-04-21T16:31:18.261Z] Task Duration: 29m58.755625774s [taskcluster 2018-04-21T16:31:18.872Z] Uploading artifact public/logs/localconfig.json from file logs/localconfig.json with content encoding "gzip", mime type "application/json" and expiry 2019-04-21T15:31:41.513Z [taskcluster 2018-04-21T16:31:19.368Z] Uploading artifact public/logs/talos_critical.log from file logs/talos_critical.log with content encoding "gzip", mime type "text/plain" and expiry 2019-04-21T15:31:41.513Z [taskcluster 2018-04-21T16:31:19.833Z] Uploading artifact public/logs/talos_error.log from file logs/talos_error.log with content encoding "gzip", mime type "text/plain" and expiry 2019-04-21T15:31:41.513Z [taskcluster 2018-04-21T16:31:20.286Z] Uploading artifact public/logs/talos_fatal.log from file logs/talos_fatal.log with content encoding "gzip", mime type "text/plain" and expiry 2019-04-21T15:31:41.513Z [taskcluster 2018-04-21T16:31:20.662Z] Uploading artifact public/logs/talos_info.log from file logs/talos_info.log with content encoding "gzip", mime type "text/plain" and expiry 2019-04-21T15:31:41.513Z [taskcluster 2018-04-21T16:31:21.275Z] Uploading artifact public/logs/talos_raw.log from file logs/talos_raw.log with content encoding "gzip", mime type "text/plain" and expiry 2019-04-21T15:31:41.513Z [taskcluster 2018-04-21T16:31:21.844Z] Uploading artifact public/logs/talos_warning.log from file logs/talos_warning.log with content encoding "gzip", mime type "text/plain" and expiry 2019-04-21T15:31:41.513Z [taskcluster 2018-04-21T16:31:22.207Z] Uploading artifact public/test_info/h1-e10s_errorsummary.log from file build/blobber_upload_dir/h1-e10s_errorsummary.log with content encoding "gzip", mime type "text/plain" and expiry 2019-04-21T15:31:41.513Z [taskcluster 2018-04-21T16:31:22.578Z] Uploading artifact public/test_info/h1-e10s_raw.log from file build/blobber_upload_dir/h1-e10s_raw.log with content encoding "gzip", mime type "text/plain" and expiry 2019-04-21T15:31:41.513Z [taskcluster 2018-04-21T16:31:22.668Z] Task not successful due to following exception(s): [taskcluster 2018-04-21T16:31:22.668Z] Exception 1) [taskcluster 2018-04-21T16:31:22.668Z] signal: killed
this seems to have slowed down, osx is our biggest problem, :rwood, when you look at this, lets just focus on the osx failures.
Noticed this happens quite often on ts_paint_heavy, here's a range: http://tinyurl.com/y82fsua6 And a failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=175391012&repo=mozilla-inbound
Update: There have been 37 failures in the last 7 days. All failures occur on OS X 10.10 / opt with 1 exception for OS X 10.10 / debug. Recent log file: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=176646993 Summary: Intermittent talos Aborting task - max run time exceeded!
this is really confusing as the h1 job (large majority of failures) can complete in 5 minutes often, and then sometimes it takes longer- if we said the average runtime is 10 minutes- having a 30 minute max runtime seems like plenty of time. Why is it that we run slower? looking at logs it seems that we spend about 20 minutes unpacking the heavy profile, for the jobs which pass we used a cached version of the profile: 01:42:50 INFO - Initialising browser for ts_paint_heavy test... 01:42:55 INFO - Local copy of 'simple' is fresh enough 01:42:55 INFO - 3 days old My conclusion is that we need to accept the fact that the average runtime is 5 minutes and that profile download+extraction will add an additional 10-30 minutes to the process. So should we adjust the maxruntime to 40 minutes? looking at the value we get from heavy vs plain: https://treeherder.mozilla.org/perf.html#/graphs?timerange=31536000&series=mozilla-inbound,1640692,1,1&series=mozilla-inbound,1640641,1,1 I see we post numbers ~5% higher for heavy- but the pattern is the same and noise levels are similar- in short we are not seeing anything unique from a heavy profile on osx. I vote to disable this test on osx as we have data for win7/win10/linux64.
we could either: 1) extend the timeout 2) do #1 and restrict to try/m-c 3) disable it I chose 3 to reduce confusion and to focus our efforts on tests that provide value. If you would prefer I could delete the line instead of comment it out.
Assignee: nobody → jmaher
Status: REOPENED → ASSIGNED
Flags: needinfo?(rwood)
Attachment #8972815 - Flags: review?(rwood)
Comment on attachment 8972815 [details] [diff] [review] disable h1 on osx Review of attachment 8972815 [details] [diff] [review]: ----------------------------------------------------------------- I completely agree, this test has always been causing issues unpacking the heavy profile. As well as Win it's also running on Linux still anyway. I vote to just remove it permanently.
Attachment #8972815 - Flags: review?(rwood) → review+
Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/64703ce328ea disable ts_paint_heavy on osx due to length of time to unpack profile. r=rwood
Whiteboard: [stockwell disable-recommended] → [stockwell disabled]
Status: ASSIGNED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Depends on: 1459393
Depends on: 1460726
No longer depends on: 1460726
This doesnt seem to be fixed. There are 35 failures in the last 7 days. Last failure in OF 11 May 2018, 05:36: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=178002803
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
this is mostly: * osx tp6 (I think a bad machine) * test-verify on android * osx-qr reftests
Setting this to fixed and telling people to use bug 1439979 for new occurrences.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: